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Abstract 

This paper discusses a stylized communications problem where one wishes to transmit a 
real- valued signal x G M" (a block of n pieces of information) to a remote receiver. We ask 
whether it is possible to transmit this information reliably when a fraction of the transmitted 
codeword is corrupted by arbitrary gross errors, and when in addition, all the entries of the 
codeword are contaminated by smaller errors (e.g. quantization errors). 

We show that if one encodes the information as Ax where A € ]^ mxn ( TO > n) is a suit- 
able coding matrix, there are two decoding schemes that allow the recovery of the block of n 
pieces of information x with nearly the same accuracy as if no gross errors occur upon trans- 
mission (or equivalently as if one has an oracle supplying perfect information about the sites 
and amplitudes of the gross errors). Moreover, both decoding strategies are very concrete and 
only involve solving simple convex optimization programs, either a linear program or a second- 
order cone program. We complement our study with numerical simulations showing that the 
encoder/decoder pair performs remarkably well. 
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1 Introduction 



This paper discusses a coding problem over the reals. We wish to transmit a block of n real values — 
a vector x G M n — to a remote receiver. A possible way to address this problem is to communicate 
the codeword Ax where A is an m by n coding matrix with m > n. Now a recurrent problem 
with real communication or storage devices is that some portions of the transmitted codeword 
may become corrupted; when this occurs, parts of the received codeword are unreliable and may 
have nothing to do with their original values. We represent this as receiving a distorted codeword 
y = Ax + zq. The question is whether one can recover the signal x from the received data y. 
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It has recently been shown [5, 6] that one could recover the information x exactly — under suitable 
conditions on the coding matrix A — provided that the fraction of corrupted entries of Ax is not 
too large. In greater details, [6] proved that if the corruption zq contains at most a fixed fraction 
of nonzero entries, then the signal x £ W 1 is the unique solution of the minimum-£i approximation 
problem 

min \\y - Ax\\ £l . (1.1) 

What may appear as a surprise is the fact that this requires no assumption whatsoever about 
the corruption pattern zq except that it must be sparse. In particular, the decoding algorithm is 
provably exact even though the entries of zq — and thus of y as well — may be arbitrary large, for 
example. 

While this is interesting, it may not be realistic to assume that except for some gross errors, one is 
able to receive the values of Ax with infinite precision. A better model would assume instead that 
the receiver gets 

y = Ax + z , z = e + z, (1.2) 

where e is a possibly sparse vector of gross errors and z is a vector of small errors affecting all 
the entries. In other words, one is willing to assume that there are malicious errors affecting a 
fraction of the entries of the transmitted codeword and in addition, smaller errors affecting all 
the entries. For instance, one could think of z as some sort of quantization error which limits the 
precision/resolution of the transmitted information. In this more practical scenario, we ask whether 
it is still possible to recover the signal x accurately? The subject of this paper is to show that it is 
in fact possible to recover the original signal with nearly the same accuracy as if one had a perfect 
communication system in which no gross errors occur upon transmission. Further, the recovery 
algorithms are especially simple, very concrete and practical; they involve solving very convenient 
convex optimization problems. 

To understand the results of this paper in a more quantitative fashion, suppose that we had a 
perfect channel in which no gross errors ever occur; that is, we assume e = in (1.2). Then we 
would receive y = Ax + z and would reconstruct x by the method of least-squares which, assuming 
that A has full rank, takes the form 

x Ideal = ( A * A yl A * y . ( L3 ) 

In this ideal situation, the reconstruction error would then obey 

\\x ld ™ l -x\U 2 = \\(A*A)- 1 A*z\\ h . (1.4) 

Suppose we design the coding matrix A with orthonormal columns so that A* A = I. Then we 
would obtain a reconstruction error whose maximum size is just about that of z. If the smaller 
errors Zi are l.i.d. iV(0,cr 2 ), then the mean-squared error (MSE) would obey 

E || x idcai _ x ||2 a = a 2 T r((A*A)- r ). 
If A* A = I, then the MSE is equal to na 2 . 

The question then is, can one hope to do almost as well as this optimal mean squared error without 
knowing e or even the support of e in advance? This paper shows that one can in fact do almost 



2 



as well by solving very simple convex programs. This holds for all signals x G W 1 and all sparse 
gross errors no matter how adversary. 

Two concrete decoding strategies are introduced: one based on second-order cone programming 
(SOCP) in Section 2, and another based on linear programming (LP) in Section 3. We will dis- 
cuss the differences between the SOCP and the LP decoders, and then compare their empirical 
performances in Section 4. 



2 Decoding by Second-Order Cone Programming 

To recover the signal x from the corrupted vector y (1.2) we propose solving the following opti- 
mization program: 

(P2) mrn ||y — Ax — z||^ subject to ||5||^ 2 < s, (2.1) 

A*z = 0, 

with variables xeP and z € M. m . The parameter e above depends on the magnitude of the small 
errors and shall be specified later. The program (P2) is equivalent to 

min l*u, subject to — u < y — Ax — z < u, (2.2) 

\\z\U2 < e > 
A*z = 0, 

where we added the slack optimization variable u G W 11 . In the above formulation, 1 is a vector 
of ones and the vector inequality u < v means componentwise, i.e., Ui < vi for all i. The program 
(2.2) is a second-order cone program and as a result, (P2) can be solved efficiently using standard 
optimization algorithms, see [1]. 

The first key point of this paper is that the SOCP decoder is highly robust against imperfections 
in communication channels. Here and below, V denotes the subspace spanned by the columns 
of A, and Q G ]g mx ( m ~ n ) [ s a matrix whose columns form an orthobasis of V- 1 , the orthogonal 
complement to V. Such a matrix Q is a kind of parity-check matrix since Q*A = 0. Applying Q* 
on both sides of (1.2) gives 

Q*y = Q*e + Q*z. (2.3) 

Now if we could somehow get an accurate estimate e of e from Q*y, we could reconstruct x by 
applying the method of Least Squares to the vector y corrected for the gross errors: 

x = (A*A)~ 1 A*(y - e). (2.4) 

If e were very accurate, we would probably do very well. 

The point is that under suitable conditions, (P2) provides such accurate estimates. Introduce 
e = y — Ax — z, and observe the following equivalence: 

(P 2 ) 44> min ||e||^ 44> (Pj) min ||e||^ 

subject to e = y — Ax — z, subject to \\Q*(y — e)\\i 2 < e. (2.5) 

A*z = 0, \\z\\e 2 < e, 
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We only need to argue about the second equivalence since the first is immediate. Observe that the 
condition A*z = decomposes y — e as the superposition of an arbitrary element in V (the vector 
Ax) and of an element in V 1 - (the vector z) whose Euclidean length is less than e. In other words, 
z = Pyx(y — e) where P v ± = QQ* is the orthonormal projector onto V 1 - so that the problem is 
that of minimizing the l\ norm of e under the constraint ||Pyx(y — e)||£ 2 < £■ The claim follows 
from the identity ||Pyxv||£ 2 = \\Q*v \\e 2 which holds for all v £ R m . 

The equivalence between (P2) and (P.Q asserts that if (x, z) is solution to (P 2 ), then e = y — Ax — z 
is solution to (P0 and vice versa; if e is solution to (P£), then there is a unique way to write y — e 
as the sum Ax + z with z G V L , and the pair (x,z) is solution to (P2). We note, and this is 
important, that the solution x to (P2) is also given by the corrected least squares formula (2.4). 
Equally important is to note that even though we use the matrix Q to explain the rationale behind 
the methodology, one should keep in mind that Q does not play any special role in (P2). 

The issue here is that if ||Pyxi;||£ 2 is approximately proportional to \\v\\i 2 for all sparse vectors v G 
M m , then the solution e to (PjQ is close to e, provided that e is sufficiently sparse [4]. Quantitatively 
speaking, if e is chosen so that ||Pyxz|^ 2 < e, then \\e — e|| is less than a numerical constant times 
e; that is, the reconstruction error is within the noise level. The key concept underlying this theory 
is the so-called restricted isometry property. 

Definition 2.1 Define the isometry constant <5& of a matrix as the smallest number such that 

(l-4)||x|| £ 2 2 <||^|| £ 2 2 <(l + ( 5 fe )||x||2 2 (2.6) 
holds for all k-sparse vectors x (a k-sparse vector has at most k nonzero entries). 

In the sequel, we shall be concerned with the isometry constants of A* times a scalar. Since AA* is 
the orthogonal projection Py onto V, we will be thus interested in subspaces V such that Py nearly 
acts as an isometry on sparse vectors. Our first result states that the SOCP decoder is provably 
accurate. 

Theorem 2.2 Choose a coding matrix A £ jj mxn with orthonormal columns spanning V , and let 
(5k) be the isometry constants of the rescaled matrix A* . Suppose ||Pyxz||^ 2 < e. Then the 
solution x to (P2) obeys 

\\x - x\\ (2 < C 2 ■ y=L= + ||x Idcal - x\\ e2 (2.7) 
V 1 ~ m 

for some numerical constant C2 provided that the number k of gross errors obeys 5^ + ^2fc < 
— 1); x Ideal is the ideal solution (1.3) one would get if no gross errors ever occurred (e = 0). 

If the (orthonormal) columns of A are selected uniformly at random, then with probability at least 
1 - 0(e-i< m - n )) for some positive constant 7, the estimate (2.7) holds for k x p • m, provided 
p < p*(n/m), which is a constant depending only n/m. 

This theorem is of significant appeal because it says that the reconstruction error is in some sense 
within a constant factor of the ideal solution. Indeed, suppose all we know about z is that \\z\\f 2 < 
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e. Then ||x Ideal — x\\z 2 = ||y4*z||^ 2 may be as large as e. Thus for m = 2n, say, (2.7) asserts 
that the reconstruction error is bounded by a constant times the ideal reconstruction error. In 
addition, if one selects a coding matrix with random orthonormal columns (one way of doing so is 
to sample X G W nxn with i.i.d. N(0, 1) entries and orthonormalize the columns by means of the 
QR factorization), then one can correct a positive fraction of arbitrarily corrupted entries, in a near 
ideal fashion. 

Note that in the case where there are no small errors (z = 0), the decoding is exact since e = 
and x Ideal = x. Hence, this generalizes earlier results [6]. We would like to emphasize that there is 
nothing special about the fact that the columns of A are taken to be orthonormal in Theorem 2.2. 
In fact, one could just as well obtain equivalent statements for general matrices. Our assumption 
only allows us to formulate simple and useful results. 

While the previous result discussed arbitrary small errors, the next is about stochastic errors. 



Corollary 2.3 Suppose the small errors are i.i.d. N(0,a 2 ) and set e := y[m — n)(l + 1) ■ a for 
some fixed t > 0. Then under the same hypotheses about the restricted isometry constants of A and 
the number of gross errors as in Theorem 2.2, the solution to (P2) obeys 

\\x-x\\j 2 < C' 2 ■ m ■ a 2 , (2.8) 

for some numerical constant C 2 with probability exceeding 1 — e -7 2 ( m -™)/ 2 _ e ~ m / 2 where 7 = 
. In particular, this last statement holds with overwhelming probability if A is chosen at 
random as in Theorem 2.2. 

Suppose for instance that m = 2n to make things concrete so that the MSE of the ideal estimate is 
equal to m/2 ■ a 2 . Then the SOCP reconstruction is within a multiplicative factor 2C of the ideal 
MSE. Our experiments show that in practice the constant is small: e.g. when m = 2n, one can 
correct 15% of arbitrary errors, and in the overwhelming majority of cases obtain a decoded vector 
whose MSE is less than 3 times larger than the ideal MSE. 



3 Decoding by Linear Programming 

Another way to recover the signal x from the corrupted vector y (1.2) is by linear programming: 

(Poo) min \\y - Ax - subject to < A, (3.1) 

A*z = 0, 

with variables x 6 M. n and z 6 M. m . As is well known, the program (Poo) may also be re-expressed 
as a linear program by introducing slack variables just as in (P2); we omit the standard details. 
As with (P2), the parameter A here is related to the size of the small errors and will be discussed 
shortly. In the sequel, we shall also be interested in the more general formulation of (Poo) 

\\y — Ax — z\\i 1 subject to \z\i < Aj, 1 < i < m, (3.2) 

A*z = 0, 
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which gives additional flexibility for adjusting the thresholds Ai, A2, . . . , A m to the noise level. 
The same arguments as before prove that (Poo) 1S equivalent to 

(P^) min HeH* subject to \\QQ*(y - e)\\ tao < A, (3.3) 

where we recall that P v ± = QQ* is the orthonormal projector onto V 1 - (V is the column space of 
A); that is, if e is solution to (P^), then there is a unique decomposition y — e = Ax + z where 
A*z = and (x,z) is solution to (Poo)- The converse is also true. Similarly, the more general 
program (3.2) is equivalent to minimizing the i\ norm of e under the constraint \Py± (y — e)|« < Aj, 
1 < i < m. 

In statistics, the estimator e solution to (P^) is known as the Dantzig selector [7]. It was originally 
introduced to estimate the vector e from the data y' and the model 

y > = Q* e + z ' (3.4) 

where z' is a vector of stochastic errors, e.g. independent mean-zero Gaussian random variables. 
The connection with our problem is clear since applying the parity-check matrix Q* on both sides 
of (1.2) gives 

Q*y = Q*e + Q*z 

as before. If z is stochastic noise, we can use the Dantzig selector to recover e from Q*y. Moreover, 
available statistical theory asserts that if Q* obeys nice restricted isometry properties and e is 
sufficiently sparse just as before, then this estimation procedure is extremely accurate and in some 
sense optimal. 

It remains to discuss how one should specify the parameter A in (3.1)-(3.3) which is easy. Suppose 
the small errors are stochastic. Then we fix A so that the true vector e is feasible for (P^J with 
very high probability; i.e. we adjust A so that 

\\ p v±(y ~ e )\Uoo = WPv^zho. < A 

with high probability. In the more general formulation, the thresholds are adjusted so that 
su Pi<j<m \Py ±z \i/ ^ 1 with high probability. 

The main result of this section is that the LP decoder is also provably accurate. 

Theorem 3.1 Choose a coding matrix A £ ]j mxra w %th, orthonormal columns spanning V , and let 
(6k) be the isometry constants of the rescaled matrix A* . Suppose HPy-L^H^ < A. Then the 
solution x to (Poo) obeys 

II* - x\\e 2 < C x Vk ■ + ||x Ideal - x\\ l2 (3.5) 

m 

for some numerical constant C\ provided that the number k of gross errors obeys 5^ + 82k < 7^ — 1/ 
the ideal solution (1.3) one would get if no gross errors ever occurred. 

If the (orthonormal) columns of A are selected uniformly at random, then with probability at least 
1 - 0(e-T( m - n )) for some positive constant 7, the estimate (3.5) holds for k x p ■ m, provided 
p < p*(n/m). 
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In effect, the LP decoder efficiently corrects a positive fraction of arbitrarily corrupted entries. 
Again, when there are no small errors (z = 0), the decoding is exact. (Also and just as before, 
there is nothing special about the fact that the columns of A are taken to be orthonormal.) We 
now consider the interesting case in which the small errors are stochastic. Below, we conveniently 
adjust the thresholds Xj so that the true vector e is feasible with high probability, see Section 5.3 
for details. 

Corollary 3.2 Choose a coding matrix A with (orthonormal) columns selected uniformly at random 
and suppose the small errors are i.i.d. N(0,a 2 ). Fix 



with very large probability, where C[ is some numerical constant. In effect, \\x — x\\ 2 2 is bounded by 
just about [1 + C[s] 2 ■ na 2 since ||x Idcal — x\\ 2 2 is distributed as a 2 times a chi-square with n degrees 
of freedom, and is tightly concentrated around na 2 . 

Recall that the MSE is equal to na 2 when there are no gross errors and, therefore, this last result 
asserts that the reconstruction error is bounded by a constant times the ideal reconstruction error. 
Suppose for instance that m = 2n. Then s 2 = 4fc(logm)/m and we see that s is small when there 
are few gross errors. In this case, the recovery error is very close to that attained by the ideal 
procedure. Our experiments show that in practice, the constant C[ is quite small: for instance, 
when m = 2n, one can correct 15% of arbitrary errors, and in the overwhelming majority of cases 
obtain a decoded vector whose MSE is less than 3 times larger than the ideal MSE. 

Finally, this last result is in some way more subtle than the corresponding result for the SOCP 
decoder. Indeed, (3.6) asserts that the accuracy of the LP decoder automatically adapts to the 
number k of gross errors which were introduced. The smaller this number, the smaller the recovery 
error. For small values of k, the bound in (3.6) may in fact be considerably smaller than its analog 



4 Numerical Experiments 

As mentioned earlier, numerical studies show that the empirical performance of the proposed de- 
coding strategies is noticeable. To confirm these findings, this section discusses an experimental 
setup and presents numerical results. The reader wanting to reproduce our results may find the 
matlab file available at http://www.acm.caltech.edu/~emmanuel/ConvexDecode.museful. Here 
are the steps we used: 

1. Choose a pair (n, m) and sample an m by n matrix A with independent standard normal 
entries; the coding matrix is fixed throughout. 




x - x\\l < [1 + C[s] 2 ■ \\x^ - x\\l s 2 = ^-. (3.6) 




(2.8). 
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2. Choose a fraction p of grossly corrupted entries and define the number of corrupted entries 
as k = round(p • m); e.g. if m = 512 and 10% of the entries are corrupted, k = 51. 

3. Sample a block of information x £ M. n with independent and identically distributed Gaussian 
entries. Compute Ax. 

4. Select k locations uniformly at random and flip the signs of Ax at these locations. 

5. Sample the vector z = (z±, . . . , z m ) of smaller errors with zi i.i.d. N(0, a 2 ), and add z to the 
outcome of the previous step. Obtain y. 

6. Obtain x by solving both (P2) and (Poo) followed by a reprojection step discussed below [7]. 

7. Repeat steps (3)-(6) 500 times. 

We briefly discuss the reprojection step. As observed in [7], both programs (Pg) an d (P^) have 
a tendency to underestimate the vector e (they tend to be akin to soft-thresholding procedures). 
One can easily correct for this bias as follows: 1) solve (P2) or (P^,) and obtain e; 2) estimate 
the support of the gross errors e via I := {i : \ei\ > a}, where a is the standard deviation of the 
smaller errors; recall that y' := Q*y = Q*e + Q*z and update the estimate by regressing y' onto 
the selected columns of Q* via the method of least squares 

e = argmin \\y' — Q*e\\ 2 2 subject to Sj = 0, i G 1°; 

3) finally, obtain x via (A* A)' 1 A*(y — e) where e is the reprojected estimate calculated in the 
previous step. 

In our series of experiments, we used m = 2n = 512 and a corruption rate of 10%. The standard 
deviation a is selected in such a way that just about the first three binary digits of each entry of the 
codeword Ax are reliable. Formally a = median (Ax)/ 16. Finally and to be complete, we set the 
threshold e in (P2) so that ||Q*z||£ 2 < e with probability .95; in other words, e 2 = x™-n(-95) • cr 2 , 
where % 2 re _ n (.95) is the 95th percentile of a chi-squared distribution with m — n degrees of freedom. 
We also set the thresholds in the general formulation (3.2) of (Poo) in a similar fashion. The 
distribution of (QQ*z)i is normal with mean and variance sf = (QQ*)i,i • cr 2 so that the variable 
z 'i = (QQ* z )i/si is standard normal. We choose Aj = A • Sj where A obeys 

sup I z-| < A 

l<i<m 

with probability at least .95. In both cases, our selection makes the true vector e of gross errors 
feasible with probability at least .95. In our simulations, the thresholds for the SOCP and LP 
decoders (the parameters X 2 „_ n (.95) and A) were computed by Monte Carlo simulations. 

To evaluate the accuracy of the decoders, we report two statistics 

P H^Jdcal ' & n d p H^Oracle ^.|| ' (^'-0 

which compare the performance of our decoders with that of ideal strategies which assume either 
exact knowledge of the gross errors or exact knowledge of their locations. As discussed earlier, 
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Ideal LS ratio for the SOCP decoder Ideal oracle ratio for the SOCP decoder 




Figure 1: Statistics of the ratios (4.1) / o Ideal (first column) and p° racle (second column) which 
compare the performance of the proposed decoders with that of ideal strategies which assume 
either exact knowledge of the gross errors or exact knowledge of their locations. The first row 
shows the performance of the SOCP decoder, the second that of the LP decoder. 



x is the reconstructed vector one would obtain if the gross errors were known to the receiver 
exactly (which is of course equivalent to having no gross errors at all). The reconstruction ;c 0racle i s 
that one would obtain if, instead, one had available an oracle supplying perfect information about 
the location of the gross errors (but not their value). Then one could simply delete the corrupted 
entries of the received codeword y and reconstruct x by the method of least squares, i.e. find the 
solution to ||y° racle — J 4° racle 5;||^ 2 , where ^4° racle (resp. y° racle ) is obtained from A (resp. y) by 
deleting the corrupted rows. 

The results are presented in Figure 1 and summarized in Table 1. These results show that both our 
approaches work extremely well. As one can see, our methods give reconstruction errors which are 
nearly as sharp as if no gross errors had occurred or as if one knew the locations of these large errors 
exactly. Put in a different way, the constants appearing in our quantitative bounds are in practice 
very small. Finally, the SOCP and LP decoders have about the same performance although upon 
closer inspection, one could argue that the LP decoder is perhaps a tiny bit more accurate. 
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median of p ideal mean of p ldcal 


median of p° racle mean of p° racle 


SOCP decoder 
LP decoder 


1.386 1.401 
1.346 1.386 


1.241 1.253 
1.212 1.239 



Table 1: Summary statistics of the ratios p Ideal and / o° racle (4.1) for the Gaussian coding matrix. 




Figure 2: Statistics of the ratios / o° racle for the SOCP decoder (first column) and the LP decoder 
(second column) in the case where the coding matrix is a partial Fourier transform. 

We also repeated the same experiment but with a coding matrix A consisting of n = 256 randomly 
sampled columns of the 512 x 512 discrete Fourier transform, and obtained very similar results. 
The results are presented in Figure 2 and summarized in Table 2. The numbers are remarkably 
close to our earlier findings and again both our methods work extremely well (again the LP decoder 
is a tiny bit more accurate). This experiment is of special interest since it suggests that one can 
apply our decoding algorithms to very large data vectors, e.g. with sizes ranging in the hundred 
of thousands. The reason is that one can use off-the-shelf interior point algorithms which only 
need to be able to apply A or A* to arbitrary vectors (and never need to manipulate the entries 
of A or even store them). When A is a partial Fourier transform, one can evaluate Ax and A*y 
by means of the FFT and, hence, this is well suited for very large problems. See [2] for very large 
scale experiments of a similar flavor. 





median of p Ldcal mean of p ldcal 


median of p° racle mean of p° racle 


SOCP decoder 
LP decoder 


1.390 1.401 
1.337 1.375 


1.244 1.262 
1.195 1.230 



Table 2: Summary statistics of the ratios p Ideal and p 0vac * e (4.1) for the Fourier coding matrix. 
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5 Proofs 



In this section, we prove all of our results. We begin with some preliminaries which will be used 
throughout, then prove the claims about the SOCP decoder, and end this section with the LP 
decoder. Our work builds on [4] and [7]. 

5.1 Preliminaries 

We shall make extensive use of two simple lemmas that we now record. 

Lemma 5.1 Let Yd ~ x\ ^ e distributed as a chi-squared random variable with d degrees of freedom. 
Then for each t > 

P(Y d -d>tV2d + t 2 ) < e - t2/2 and P{Y d - d < -t V2d) < e"* 2/2 . (5.1) 

This is fairly standard [17], see also [16] for very slightly refined estimates. A consequence of these 
large deviation bounds is the estimate below. 

Lemma 5.2 Let U = (ui, U2, ■ ■ ■ , u m ) be a vector uniformly distributed on the unit sphere in m 
dimensions. Let Z n = u\ + . . . + be the squared length of the first n components of u. Then for 
each t < 1/2 

p (z n < £ (1 - i)) < e~ nt2 l 1& + e~ m * 2 / 24 . (5.2) 

Proof Suppose Xi, X2, . . . , X rn are i.i.d. N(0, 1). Then the distribution of U is that of the vector 
X/||X||^ 2 and, therefore, the law of Z n is that of Y n /Y m , where = YljKk^j- Define the events 
A e = {Y m > (1 + e)m} for some e > and B t = {Y n /Y m < n/m(l - t)}. We have 

P(B t ) = P(B t I A° e ) P(^) + P(B t I A £ ) P(A £ ) 

< P(Y n < n(l - t)(l + e)) + P(y m > (1 + e)m). 

With e = t/2, this gives n(l -t)(l + s)< n(l - t/2) and thus 

P(Z n < n/m(l - t)) < P(Y n < n(l - t/2)) + P(Y m > (1 + e)m) 

< e ^ 2 /16 + e -7 2 ™/2 5 

which follows from (5.1), where 7 obeys t/2 = 7\/2 + 7 2 - For t < 1/2, 7 < l/2\/3 (for small values 
of t, 7 « t/2\[2) and the conclusion follows. ■ 

5.2 The SOCP decoder 

For a matrix <E>, define the sequences (a^) and (&&) as respectively the largest and smallest numbers 
obeying 



11 



for all fc-sparse vectors. In other words, if we list all the singular values of all the submatrices of 
$ with k columns, a k is the smallest element from that list and bk the largest. Note of course the 
resemblance with (2.6) — only this is slightly more general. We now adapt an important result from 
[4]- 

Lemma 5.3 (adapted from [4]) Set $ G W xm and let (a k ) and (bk) be the restricted extremal 
singular values of & as in (5.3). Any point x G W a obeying 

ll^ll^i < II^IUd and \\$x — $x\\i 2 < 2e, (5-4) 

also obeys 

x ~ x ^ ttt 1 , , (5.5) 

a 3 fc($) - 75 

provided that x is k-sparse with k such that a^ki^) — ^ °2k(^) > 0. 

The proof follows the same steps as that of Theorem 1.1 in [4], and is omitted. In particular, it 
follows from (2.6) in the aforementioned reference with M = 2\Tq\ and a M+ \ T ^ (resp. 6m) in place 
of yjl - 5\ T{) \ +M (resp. y/l + 5m) in the definition of C\ To \ jM . 

5.2.1 Proof of Theorem 2.2 

Recall that the solution (x,z) to (Pj) obeys (2.4) where e is the solution to (P^)- Replacing y in 
(2.4) with Ax + e + z gives 

x-x = (A*A)- 1 A*(e - e) + (A*A)~ 1 A*z 

= (A*A)- 1 A*(e-e)+x ldeal -x, (5.6) 

and since A*A = I, 

\\x - x\\ h < \\A*(e - e)\\ h + ||x Idcal - x\\ h . 
To prove (2.7), it then suffices to show that ||e — e\\e 2 < = since the 2-norm of A* is at most 1. 

V m 

By assumption ||Q*(y — e)||^ 2 = ||Q*^||^ 2 ^ e an d thus, e is feasible for (PQ which implies \\e\\e 1 < 
HeH^. Moreover, 

\\Q*e - Q*e\\ e2 < \\Q*(y - e)\\e 2 + \\Q*(y - e)\\e 2 < 2e. 
We then apply Lemma 5.3 (with <I> = Q*) and obtain 

II -II «- ^ e IK>7\ 

e-ek < 1 -. 5.7) 

2 " a 3k (Q*) - ^b 2k (Q*) 

Now since the m x m matrix obtained by concatenating the columns of A and Q is an isometry, 
we have 

\\A*x\\l + \\Q*x\\l = \\x\\l Vx G M m , 
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whence 

al(Q*) = l-bUA*), 
b 2 k (Q*) = l-al(A*). 

Assuming that a 3k (Q*) > -^^2fc(Q*)> we deduce from (5.7) that 

lie - e|| & < ^ ■ TT^PT^ — 2Vfe ■ y — , <( - } _ ^ . (5.8) 

Recall that ((5^) are the restricted isometry constants of y^A*, and observe that by definition for 
each = 1,2,..., 

4(A*)>^-(l-<5,), %(A*)<^(l + 5 k ). 
It follows that the denominator on the right-hand side of (5.8) is greater or equal to 

+ — 1 - ,5 2A; - - 1 + 6 3k ) = - 1 - - - - [S 3k + -5 2fc 
2 2m m 2 V m/ m \ 2 , 

Now suppose that for some < c < 1, 

^ + 2^^2-(n- 1 )- 

This automatically implies a3fc((5*) > -^&2fc(Q*), and the denominator on the right-hand side of 
(5.8) is greater or equal to ^(1 — c)(l — ^). The numerator obeys 

a 2 3k (Q*) = 1 - b 2 3k (A*) < 1 - a 2 3k (A*) < 1 - (1 - <5 3fc )^. 

Since ^ <5 3fe < § (1 - ^), we also have a 2 3k {Q*) < (1 + § )(1 - £). In summary, (5.8) gives 

£ 

We - e /, < Co 



where one can take C2 as 4-^/6(1 + c/2)/(l — c). This establishes the first part of the claim. 

We now turn to the second part of the theorem and argue that if the orthonormal columns of A are 
chosen uniformly at random, the error bound (2.7) is valid as long as we have a constant fraction 
of gross errors. Put r = m — n and let X be an m by r matrix with independent Gaussian entries 
with mean and variance 1/m. Consider now the reduced singular value decomposition of X 

X = UZV*, U G M mxr and E, V G R rxr . 

Then the columns of U are r orthonormal vectors selected uniformly at random and thus U and Q 
have the same distribution. Thus we can think of Q as being the left singular vectors of a Gaussian 
matrix X with independent entries. From now on, we identify U with Q. Observe now that 

\\X*(e - e)\\e 2 = \\VZQ*(e - e)\\t 2 = ||£Q*(e - e)\\t 2 < a^X) \\Q*(e - e)\\t 2 , 
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where (J\{X) is the largest singular value of X. The singular values of Gaussian matrices are well 
concentrated and a classical result [10] shows that 

p(a 1 (X)>l + x [^ + t) <e~ m ' 2 / 2 . (5.9) 



By choosing t = 1 in the above formula, we have 

\\X*(e-e)\\ t2 <3||Q*(e-e)|| <a < 6e 

with probability at least 1 — e~ m l 2 since ||Q*(e — e)\\i 2 < 2e. We now apply Lemma 5.3 with 
$ = X* , which gives 

II _ „|, 3\/6e /m 3\/6e 

e|ka " a 3fc (X*) - 6 2fc (X*) " V r ' a 3fe (Y*) - M^*) ' 

where V = y^X. The theorem is proved since it is well known that if k < cq ■ r/\og(m/r) for 
some constant cq, we have 03^(1"*) — -^&2fc(^*) > ci with probability at least 1 — 0(e~ 7 ' r ) for 
some universal constants c\ and 7; this follows from available bounds on the restricted isometry 
constants of Gaussian matrices [6, 8, 12, 21]. 



5.2.2 Proof of Corollary 2.3 

First, we can just assume that a = 1 as the general case is treated by a simple rescaling. Put 
r = m — n. Since the random vector z follows a multivariate normal distribution with mean zero 
and covariance matrix I m (I m is the identity matrix in m dimensions), Q*z is also multivariate 
normal with mean zero and covariance matrix Q*Q = I r . Consequently, ||Q*^||f 2 is distributed as 
a chi-squared variable with r degrees of freedom. Pick A = 7 y/r in (5.1), and obtain 

P(||Q*2||? a > (l + 7^2 + 7>) <e-^ 2 . 

With t = jy/2 + 7 2 so that 7 = (VI + 2t - 1)/V2, we have ||Q*2||^ 2 < y/r(l+t) with probability 
at least 1 — e --y 2 ( m - n )/ 2 . On this event, Theorem 2.2 asserts that 

\\x - x\\e 2 < C \Jm(l+t) + \\x - x Ideal ||^ 2 . 

This essentially concludes the proof of the corollary since the size of \\x — x Ideal ||^ 2 is about y/n. 
Indeed, \\x — x Ideal ||| 2 = ||A*2;|| 2 2 ~ Xn as observed earlier. As a consequence, for each to > 0, we 
have ||x — x Ideal ||^ 2 < y/n(l + to) ■ cr with probability at least 1 — e~ 7 o n//2 , where 70 is the same 
function of to as before. Selecting to as to = m/n, say, gives the result. 



5.3 The LP decoder 

Before we begin, we introduce the number 9k,k' °f a matrix $ G M rxm for k + k' < m called the 
k, k' -restricted orthogonality constants. This is the smallest quantity such that 

\($vM)\<6k,k>-\\v\\e a \\v'\\e a (5.11) 
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holds for all k and fc'-sparse vectors supported on disjoint sets. Small values of restricted orthog- 
onality constants indicate that disjoint subsets of columns span nearly orthogonal subspaces. The 
following lemma which relates the number k ^ k i to the extremal singular values will prove useful. 



Lemma 5.4 For any matrix <J> £ W xm , we have 

Proof Consider two vectors v and v' which are respectively k and /c'-sparse. By definition we have 

2 a l +k ,($) < \\$v + $v'\\i <2b 2 k+k ,(<5>), 
24 +jfc ,($)<||$ v -W||, 2 2 <2^ +fc ,($), 

and the conclusion follows from the parallelogram identity 

K*v, w)i = \ \\*v + wii, 2 2 - \*v - wiifj < \ (b 2 k+k ,(*) - 4 +k ,m- 



The argument underlying Theorem 3.1 uses an intermediate result whose proof may be found in 
the Appendix. Here and in the remainder of this paper, xj is the restriction of the vector x to an 
index set /, and for a matrix X, Xj is the submatrix formed by selecting the columns of X with 
indices in /. 

Lemma 5.5 Let & be an r x m- dimensional matrix and suppose Tq is a set of cardinality k. For 
a vector h G W 71 , we let T\ be the k' largest positions of h outside of To. Put Tqi =Tq\JT\. Then 

a UA®) a 2 k+k ,($)VF 



and 

\\h\\l<\\h Toi \\l + y\\hT § \\l. (5.13) 
5.3.1 Proof of Theorem 3.1 

Just as before, it suffices to show that ||e — e\\e 2 < C\fk • A • (1 — n/m)" 1 . Set h = e — e and let 
To be the support of e (which has size k). Because e is feasible for (P^) we have on the one hand 
\\e\W < ||e||^, which gives 

Hero II* - IIHIk + fell/i < \\e + h\\ h < \\e\\ tl => ||/iT c lk < \\hr \\ti- 
Note that this has an interesting consequence since 

fell* ^ HHIk < Vk ■ \\kr \\e 2 (5-14) 
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by Cauchy Schwarz. On the other hand 

\\QQ*h\\ eoa < \\QQ*(e - y)\\ eoc + \\QQ*(y - e)\\ too < 2A. (5.15) 

The ingredients are now in place to establish the claim. We set k' = k, apply Lemma (5.5) with 
<3? = Q* to the vector h = e — e, and obtain 

\\h\\e 2 < V2\\h Toi \\ i2 , and \\h Toi \\ e2 < 2 - \\Q Toi Q*h\\ e2 . (5.16) 

a 2kW ) ~ VkMQ ) 



Since each component of Qr 01 Q*h is at most equal to 2A, see (5.15), we have ||Qt iQ*^||^2 — v2/c-2A. 
We then conclude from Lemma 5.4 that 

r 2A 

'"-^'W+RW-ftW (5 ' 17> 

For each k, recall the relations a\(Q*) = 1 - b\(A*) and b\(Q*) = 1 - a|(^4*) which give 
||/»||/ a < 4^ • A £,:=!_ 6^*) _ l 6 2 fc(j4 * } + l a 2 fe(A *). 

Now just as before, it follows from our definitions that for each k, b\{A*) < ^(l + Sk) and a\{A*) > 
^(1 — 5k). These inequalities imply 

D>l--(l + 5 2k + 5 3k ). 
m 

Therefore, if one assumes that 

(771 \ 
— - lj , 

for some fixed constant < c < 1, then 

II -II iif.il ^ 4 ^ A 
||e - e\\ t2 = \\h\\z 2 < 



1-c 1-^' 



in 



This establishes the first part of the theorem. 

We turn to the second part of the claim; if the orthonormal columns of A are chosen uniformly 
at random, we show that the error bound (3.5) is valid with large probability as long as we have 
a constant fraction of gross errors. The same argument as before (albeit for a general value of k') 
gives 

\\h\\ i2 < ^Jf ~, D := al +k ,(Q*) + \^(al +2 AQ*) ~ b l+2k'(Q*))- (5-18) 

so that this is really a question about the extremal singular values of random orthogonal projections 
when restricted to sparse inputs. 

Put r = m — n and let X be an m by r matrix with independent Gaussian entries with mean and 
variance 1/m. Recall the QR factorization of X. 

X = Q'R, Q' G R mxr , R e R rxr , 
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where R is upper triangular. The columns of Q' are r orthonormal vectors selected uniformly at 
random and thus, Q and Q' have the same distribution so that we can think of Q as being the 
Q-factor in the QR factorization of X. Also observe that crj(X) = (Jj(R), 1 < j < r, i.e. the 
nonzero singular values of R and X coincide. It follows from 

v*XX*v = v*QRR*Q*v 

(which is valid for all v G R m ) that 

a r (X)\\Q*v\\ i2 < \\X*v\\ e2 < atWWQ^. 
Applying the above inequalities to fc-sparse vectors gives 

h(Q*) < -j—b k (X*), 
a r {X) 

a k (Q*) > —^—a k (X*). 

The point is that the extremal singular values (a k (X*), b k (X*))i< k < r are perhaps easier to study 
than those of Q* . 

Indeed, classical results from random matrix theory [10, 18] assert that for each t > 0, 

P (*i(X*) > 1 + (1 + t)^j < e- rt ' 2 '\ (5.19) 

P (a r (X*) < 1 - (1 + t)^\ < e- rt2 ' 2 . (5.20) 

These inequalities can be specialized to r x k submatrices of X, and taking the union bound also 
show that for each t > 

' ™ bk (X*)>l + X [*- + t)<h)e-^, (5.21) 



r V t J \k 

Vxi-^-^tt^. (5.22) 



r V r I \k 

We use these estimates to bound below the denominator D in (5.18). We first study the case where 
r < m/16 and in the sequel, we will denote ^Jkjk! by p. First, pick t = 1/3 in (5.19)-(5.20). Then 
the event 

E := {2/3 < a r (X) < a^X) < 4/3} 
has probability at least 1 — 2e~ r / 18 . On this event, the denominator D in (5.18) obeys 

D > (3/4) 2 4 +fc ,(X*) - (2,/2f P -bl +2k ,(X*) = A (al +k ,(X*) - 2pbl +2k ,(X*)). 

Second, selecting t = 1/8 in (5.21) and (5.22) shows that the events E\ and E2 respectively equal 
to 

m 7 lk + k' fm l 9 jk + 2k' 

-a k+k ,(X ) > - - y and J —b k+2k >{X ) > - + 



17 



have probability at least 1 - ( fc + fc ,)e~ r/128 and 1 - ( feH ^ fc /)e _r/128 . Third, select k! to be the smallest 
integer so that p = y/k/k' < 1/8. Combining these facts gives 



P ID > 



9r 
16m 



7 



k + k' 




where r\ = 2( fcH ^ A ,,)e r / 128 + 2e r / 18 . Elementary calculations show that 
7 lk + kf 




k + 2k'\ 1 

> — if 
~ 16 



and 



7 




k + k' 1 k + 2k' 1 
r + 2 V r - 4' 



> 1 



under the same condition. In summary, D > 9r/256m with probability at least 1 — 7] provided that 

1 



k + k' 1 k + 2k' 
r + 2V r - 



1 



66 + -V131 < 4 



since k'/k < 65. Finally k + 2k! < 131/c and assuming 131A; < m/2, we also have 



m \ , / m \ , r 7TT- 

13lJ^ im 



l0g U + 2^- l0g 



1 + log 



131fc 



(5.23) 



In other words, D > 9r/256m with probability at least 1 — 2e _r//256 — 2e~ r / 18 as long as the right- 
hand side of (5.23) is less or equal to r/256. It follows from (5.18) that with at least the same 
probability, ||e — e\\e 2 < C\ ■ (m/r) ■ \fk ■ A and hence, one can correct a constant fraction of errors 
(the fraction depends on n/m of course) in the case where r < m/16. 

It remains to argue that the result is valid when r > m/16. Here, the denominator D in (5.18) 
obeys 

D>bl +2k ,{Q*)-p/2>bl(Q*)-p/2. 

Let Z r be the squared £2 norm of the first column of Q*, i.e. Z r = (QQ*)i t \. We have b\{Q*) > Z r 
and moreover, Lemma 5.2 proved that 

P(Z r < 3r/4m) < 2e" 7 o r/2 



for some constant 70 > 0. Pick k! to be the smallest integer so that p = ^Jkjk' < r/m. Then 
D > r/Am, and we conclude that ||e — e||^ 2 < C- (m/r) ■ \fk- A with probability at least 1 — 2e _7 ° r / 2 . 
To be complete, the condition on k is k + 2k' < r which is satisfied if fc(3 + (m/r) 2 ) < r or 
equivalently k < po ■ m with po = r/m ■ (1 + (m/r) 2 ) -1 . 
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5.3.2 Proof of Corollary 3.2 



First, we can just assume that a = 1 as the general case is treated by a simple rescaling. The random 
vector QQ*z follows a multivariate normal distribution with mean zero and covariance matrix QQ*. 
In particular (QQ*z)i ~ iV(0, sf), where sj = (QQ*)i t i- This implies that z\ = (QQ*z)i/si is 
standard normal with density <p(t) = (27r)~ 1 / 2 e - * 2 / 2 . For each i, P(|z-| > t) < (f>(t)/t and thus 



P 



sup >t) <2m- 4>(t)/t. 

l<i<m J 



With t = v / 2~Tog~m, this gives P(sup 1 <j< m |z-| > ^/2\ogm) < \j\pK logm. Better bounds are 
possible but we will not pursue these refinements here. Observe now that s? = ||Qi,.||f = 1— H^i,. ||| , 
and since Aj = \/21ogm ||Qi,-||^ 2 ) we have that 

\QQ*Zi\ <Xi, Vi (5.24) 

with probability at least 1 — 1/^/TFlogm,. 

On the event (5.24), Theorem 3.1 then shows that 

\\x — x\\i 2 < C \f~k ■ (m/r) ■ max |A i | + ||rr-s Weal ||* 2 . (5.25) 

i 

We claim that 

max » I ^ | ,, „ ,, /3r 

-== = max \\Qi \\e 2 < W— (5.26) 
V21ogm i V m 

with probability at least 1 — 2e~ 7m for some positive constant 7. Combining (5.25) and (5.26) 
yields 

\\x - x\W < 2C ■ .Vk + \\x- x^\ 2 . 

V m — n 

This would essentially conclude the proof of the corollary since the size of \\x — x Ideal ||£ 2 is about 
y/n. Exact bounds for \\x — x Ideal ||^ 2 are found in the proof of Corollary 2.3 and we do not repeat 
the argument. 

It remains to check why (5.26) is true. For r > m/3 and since ||(5vH^ 2 — 1) the claim holds with 
probability 1 because 3r/m > l! For r < m/3, it follows from ||Qi,.|| 2 2 + ||-4i,-||| 2 = 1 that 

P (max ||Q,,.||? 2 > |) = P (ndo ||A,lll < i (l - I)) < m P { M <l (l " ■ 
The claim follows by applying Lemma 5.2 since r/n < 1/2. 



6 Discussion 

We have introduced two decoding strategies for recovering a block x £ W 1 of n pieces of information 
from a codeword Ax which has been corrupted both by adversary and small errors. Our methods 
are concrete, efficient and guaranteed to perform well. Because we are working with real valued 
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inputs, we emphasize that this work has nothing to do with the use of linear programming methods 
proposed by Feldman and his colleagues to decode binary codes such as turbo-codes or low-density 
parity check codes [13-15]. Instead, it has much to do with the recent literature on compressive 
sampling or compressed sensing [3, 8, 9, 11, 20, 22], see also [19, 23] for related work. 

On the practical end, we truly recommend using the two-step refinement discussed in Section 4 — 
the reprojection step — as this really tends to enhance the performance. We anticipate that other 
tweaks of this kind might also work and provide additional enhancement. On the theoretical end, 
we have not tried to obtain the best possible constants and there is little doubt that a more careful 
analysis will provide sharper constants. Also, we presented some results for coding matrices with 
orthonormal columns for ease of exposition but this is unessential. In fact, our results can be 
extended to nonorthogonal matrices. For instance, one could just as well obtain similar results for 
m x n coding matrices A with independent Gaussian entries. 

There are also variations on how one might want to decode. We focused on constraints of the form 
|| Pyx 5 1| where || • || is either the £2 norm or the £00 norm, and P v ± is the orthoprojector onto V -1 , the 
orthogonal subspace to the column space of A. But one could also imagine choosing other types of 
constraints, e.g. of the form ||X*5||^ 2 < e for (P2) or IIXX*^!^ < A for (Poo) (or constraints about 
the individual magnitudes of the coordinates (XX*z)i in the more general formulation), where the 
columns of X span V L . In fact, one could choose the decoding matrix X first, and then A so that 
the ranges of A and X are orthogonal. Choosing X € R mxr with i.i.d. mean-zero Gaussian entries 
and applying the LP decoder with a constraint on IIXX*^!^ instead of H^H^ would simplify the 
argument since restricted isometry constants for Gaussian matrices are already readily available 



Finally, we discussed the use of coding matrices which have fast algorithms, thus enabling large 
scale problems. Exploring further opportunities in this area seems a worthy pursuit. 

7 Appendix: Proof of Lemma 5.5 

The proof is adapted from that of Lemma 3.1 in [7]. In the sequel, To C {1, ... , m} is a set of size 
k, T\ is the k' largest positions of h outside of To, Poi = Po U Pi and Vbi C M. m is the subspace 
spanned by the columns of <I> with indices in Tqi. Below, we omit the dependence on <I> in the 
constants afc(<3?) and 9k,k'(^)- 

Let Py 01 be the orthogonal projection onto Vqi- For each w 6 M m , $ T w = $j , Py 01 w and thus, 



since all the singular values of 3>t i are lower bounded by d|r i| = a k+k'- With w = <&h, this gives 



[6,8,12,21]! 




'Toi ™ Ik ^ a k+k> \\Pv 01 w\\i 2 , 



(7.1) 



\\P Vo i^h\\e 2 < a 



-1 



^Toi^lk 



(7.2) 



k+k' 



Next, divide Pq into subsets of size k' and enumerate Pq as rii, ri2, . . . , n m _\T \ i n decreasing order 
of magnitude of fiT§ ■ Set Tj = {ri£, (j — l)k' + 1 < i < jk'}. That is, T\ is as before and contains the 
indices of the k' largest coefficients of hx§, P2 contains the indices of the next k! largest coefficients, 
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and so on. We will develop a lower bound on the £2 norm of Py 01 $/t, which we decompose as 

P Voi $h = P Voi $h Toi + Yl p Voi ®h Tj = <S>h Toi + Yl p Voi ®h Tj . (7.3) 

i>2 i>2 

By definition Pv ^hr j £ Vbi and thus 

Pv in $h Tj = $t 01 c => a k+k >\\c\\ i2 < \\Pvoi®h Tj \\e 2 , 

which again follows from the lower bound on the singular values of <J>t i (the coefficient sequence c 
depends on j). Observe now that 

\\Pv i®h Tj \\j 2 = (P V(n <5>h Tj ,<S>h Tj ) = (<f> Toi c,<S>h Tj ) 

$/%^/%/ k i 

< Ok+k',k' \\c\\e 2 \\hTjWh < — ||-Pvbi $ ^T,lk 2 ll^r,lk 2 , 

where the first inequality follows from the definition (5.11) of the number of 9 k+k > jk >. In short, 

||iVoi*M* a < ^^IIHUfe- (7-4) 
flfe+fc' 

We then develop an upper bound on J2j>2 \\^ l T j \\e 2 as in [4]. By construction, the magnitude of 
each coefficient in Tj + i is less than the average of the magnitudes in Tj, 

WhTs+Aloo <\\h Tj \UJ k ' \\hT ]+1 \\l<\\h Tj \\l/k'. 

Therefore, 

EllKI^ < EHKIk/^ 7 = \\hh l{T§ )/^. (7.5) 

J>2 j>l 

Hence, we deduce from (7.3) that 



\Pv 01 *h\\ ta > \\*hr 01 \\t 2 ~ E ll^o^HH^ ^ a k+k ,\\h Toi \\ e2 - -*±^*1 £ H^. 



j>2 a k+k' 



Mi 



> Ofc+fc'II^Toillfe ^ 7= ||^||^(T n c )- 

Combining this with (7.2) proves the first part of the lemma. 

For the second part, observe that the jth largest value of \Ht§\ obeys \hT§\(j) < ||^T C 1 1 €1 / J ; whence 

\\h TSl \\l < \\h T§ \\l £ r 2 < iimIA'- 

j>k'+l 

The lemma is proven. 
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